InΒ [Β ]:
import librosa
import numpy as np
from IPython.display import Audio
import matplotlib.pyplot as plt
import holoviews as hv
hv.extension('bokeh')
file_names = ['audio/03-01-01-01-01-02-01.wav',
              'audio/20 - 20,000 Hz Audio Sweep Range of Human Hearing.mp3', 
              'audio/videoplayback.mp3']
No description has been provided for this image No description has been provided for this image

Time-Series DataΒΆ

The loaded data is represented as audio_data, which is depicted in terms of time and magnitude.
Does this mean that all the information about the sound we hear is encompassed in this representation?
Is what we hear merely the magnitude of these sounds?
Although this representation appears simple, it can be quite challenging and unintuitive to work with.

Here are some issues encountered:

  1. The interpretation, while, is straightforward, it is not intuitive. For instance, what does a pattern of high amplitude followed by a sudden drop signify? Or what does a gradual increase in amplitude imply?
  2. Even a slight time shift can lead to significantly different analyses.
  3. what we see here is amplitude. How do we percieve sound? It's not immediately clear how these amplitude variations translate into the sounds and tones we perceive and understand.
InΒ [Β ]:
for i in range(len(file_names)):    
    file_name = file_names[i]
    audio_data, sample_rate = librosa.load(file_name)

    time = librosa.times_like(audio_data, sr=sample_rate)
    plot = hv.Curve((time, audio_data)).opts(width=1100, height=400, title="Waveform: " + file_name)

    display(plot)
    display(Audio(data=audio_data, rate=sample_rate, autoplay=False))
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.

Little bit of time and little bit of frequency.ΒΆ

We (human) can understand how human hear better through frequency analysis.
We can analyze "tones", "timbre", "pitch", "octaves", and other representations making the frequency domain a more intuitive to work with.

However, naively applying the Discrete Fourier Transform (DFT) to convert time-domain data into the frequency domain can result in a loss of important time-related information.
note: recall that when applying fourier transform the time variable $t$ disappears.

Short-Time Fourier Transform (STFT)ΒΆ

The Short-Time Fourier Transform (STFT) applies a window function to the signal in the time domain and then computes the Fourier Transform (FFT) within each windowed segment.
Windows usually overlap with each other. This overlap ensures continuity between adjacent segments and helps to avoid artifacts at the boundaries of the windows.


InΒ [Β ]:
for i in range(len(file_names)):
    file_name = file_names[i]
    audio_data, sample_rate = librosa.load(file_name)
    
    stft = librosa.stft(audio_data)

    plt.figure(figsize=(16, 5))
    librosa.display.specshow(stft, sr=sample_rate, x_axis='time', y_axis='log')
    plt.colorbar(format='%+2.0f dB')
    plt.title('Magnitude of STFT')
    plt.show()

    display(Audio(data=audio_data, rate=sample_rate, autoplay=True))
C:\Users\pongs\AppData\Local\Temp\ipykernel_9444\3372860683.py:8: UserWarning: Trying to display complex-valued input. Showing magnitude instead.
  librosa.display.specshow(stft, sr=sample_rate, x_axis='time', y_axis='log')
No description has been provided for this image
Your browser does not support the audio element.
No description has been provided for this image
Your browser does not support the audio element.
No description has been provided for this image
Your browser does not support the audio element.

spectrogramΒΆ

  • stft for visualization
  • graphical representation
  • amplitude of frequencies
InΒ [Β ]:
for i in range(len(file_names)):
    file_name = file_names[i]
    audio_data, sample_rate = librosa.load(file_name)
    
    stft = librosa.stft(audio_data)

    magnitude_stft = np.abs(stft) # discard phase-information
    db = librosa.amplitude_to_db(magnitude_stft, ref=np.max) # use np.max so utilize entire frequency channels.

    # Plot
    plt.figure(figsize=(16, 5))
    librosa.display.specshow(db, sr=sample_rate, x_axis='time', y_axis='log')
    plt.colorbar(format='%+2.0f dB')
    plt.title('Spectrogram')
    plt.show()

    display(Audio(data=audio_data, rate=sample_rate, autoplay=True))
No description has been provided for this image
Your browser does not support the audio element.
No description has been provided for this image
Your browser does not support the audio element.
No description has been provided for this image
Your browser does not support the audio element.

mel spectrogramΒΆ

  • focus on low frequency.
  • approximation of human hearing perception.
  • spectrogram and mel-spectrogram represent same information but mel-spectrogram represent frequency in mel scale which is more algin with how human hear.
InΒ [Β ]:
for i in range(len(file_names)):
    file_name = file_names[i]
    audio_data, sample_rate = librosa.load(file_name)
    
    S = librosa.feature.melspectrogram(y=audio_data, sr=sample_rate)
    log_S = librosa.power_to_db(S, ref=np.max)

    plt.figure(figsize=(16, 5))
    librosa.display.specshow(log_S, sr=sample_rate, x_axis='time', y_axis='mel')
    plt.colorbar(format='%+2.0f dB')
    plt.title('Mel spectrogram')
    plt.tight_layout()
    plt.show()

    display(Audio(data=audio_data, rate=sample_rate, autoplay=True))
No description has been provided for this image
Your browser does not support the audio element.
No description has been provided for this image
Your browser does not support the audio element.
No description has been provided for this image
Your browser does not support the audio element.

Spectrogram vs Mel-SpectrogramΒΆ

  • They represent same informations just in different frequency scale.
  • Spectorgram often used in visualization because they provide a linear and precise representation of the frequency content over time. This makes them more intuitive for human interpretation.
  • Mel-spectrograms are preferred in machine learning because they offer a representation because it is more efficient in terms of computational resources and feature relevance.
  • The mel scale's emphasis on lower frequencies, where most of the information in human speech lies
  • Tasks like genre classification, mood detection, or artist identification often benefit from the human-auditory-aligned representation in mel-spectrograms.
  • Identifying specific sounds or events within an audio file (like glass breaking, applause, etc.) is often more effective with mel-spectrograms.

  • These are true in general but it is hard to say. Some human might prefer certain representation, alike, some model may work better on some representation.

Other representationsΒΆ

Chroma features (aka Pitch Class Profiles (PCP))ΒΆ

src: https://s18798.pcdn.co/jpbello/wp-content/uploads/sites/1691/2018/01/6-tonality.pdf

  • Terminologies:
    • Octave: An octave in music is an interval between one musical pitch and another with double its frequency. This means that when you move up an octave, you are doubling the frequency of the original pitch. For example, if you start with the note A at 440 Hz, the next A in the higher octave would be at 880 Hz.
    • Pitches: Pitches in music refer to specific musical notes. In Western music, there are 12 distinct pitches in each octave, which include notes like A, A#, B, C, C#, D, D#, E, F, F#, G, and G#. Each of these pitches corresponds to a specific frequency.

The pitch helix: Models the special relationship that exists between octave intervals.
Height: naturally organizes pitches from low to high
Chroma: represents the inherent circularity of pitch organization

pitch_helix

!!! When the frequency of a musical note is doubled, it is perceived by our ears as being the same pitch, but in a higher octave.

  • Chroma features are used to represent the presence or absence of these 12 pitch classes in a piece of music, regardless of their octave.

    pitch_helix

https://youtu.be/WAnGsp9wajk?t=58

  • western music theroy of 12 pitches.
  • good for western instruments
InΒ [Β ]:
for i in range(len(file_names)):
    file_name = file_names[i]
    audio_data, sample_rate = librosa.load(file_name)    
    
    chroma = librosa.feature.chroma_stft(y=audio_data, sr=sample_rate)
    librosa.display.specshow(chroma, x_axis='time')
    plt.show()

    display(Audio(data=audio_data, rate=sample_rate, autoplay=True))
No description has been provided for this image
Your browser does not support the audio element.
No description has been provided for this image
Your browser does not support the audio element.
No description has been provided for this image
Your browser does not support the audio element.

Cepstral AnalysisΒΆ

  • primarily used as a preprocessing step in machine learning pipelines
  • https://medium.com/@abdulsalamelelu/mfcc-the-dummys-guide-fd7fc471db76

  • Convolution to Addition
  • Assumption: convolution is involved in the original signal, which is generally valid.
    cepstral_formula

Mel-Frequency Cepstral Coefficients (MFCC)ΒΆ

src: https://www.researchgate.net/figure/The-main-steps-for-calculating-MFCC_fig5_266895811

mfcc steps

windowing: take small time period, smooth out.
windowing

Mel Frequency Wrapping: focus on low frequency
mel freq wrapping

  • frequency of log of frequency?
  • not good for visualization
  • have good mathematical properties, hence often used in machine learning.
InΒ [Β ]:
for i in range(len(file_names)):
    file_name = file_names[i]
    audio_data, sample_rate = librosa.load(file_name)
    
    n_mfccs = [5, 13, 20, 50]
    mfccs = [librosa.feature.mfcc(y=audio_data, sr=sample_rate, n_mfcc=n_mfcc) for n_mfcc in n_mfccs]

    fig, axs = plt.subplots(2, 2, figsize=(14, 6))
    for ax, mfcc, n_mfcc in zip(axs.ravel(), mfccs, n_mfccs):
        librosa.display.specshow(mfcc, x_axis='time', ax=ax)
        ax.set_title(f'n_mfcc: {n_mfcc}')

    plt.suptitle('mfcc plot with different number of components.')
    plt.tight_layout()
    plt.show()

    display(Audio(data=audio_data, rate=sample_rate, autoplay=True))
No description has been provided for this image
Your browser does not support the audio element.
No description has been provided for this image
Your browser does not support the audio element.
No description has been provided for this image
Your browser does not support the audio element.

ThanksΒΆ